import pandas as pd
import plotly.express as px
# Load the dataset
df = pd.read_csv("US_Accidents_March23.csv")
# Display the first few rows of the dataset
df.head()
| ID | Source | Severity | Start_Time | End_Time | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | ... | Roundabout | Station | Stop | Traffic_Calming | Traffic_Signal | Turning_Loop | Sunrise_Sunset | Civil_Twilight | Nautical_Twilight | Astronomical_Twilight | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | A-1 | Source2 | 3 | 2016-02-08 05:46:00 | 2016-02-08 11:00:00 | 39.865147 | -84.058723 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Night | Night | Night |
| 1 | A-2 | Source2 | 2 | 2016-02-08 06:07:59 | 2016-02-08 06:37:59 | 39.928059 | -82.831184 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Night | Night | Day |
| 2 | A-3 | Source2 | 2 | 2016-02-08 06:49:27 | 2016-02-08 07:19:27 | 39.063148 | -84.032608 | NaN | NaN | 0.01 | ... | False | False | False | False | True | False | Night | Night | Day | Day |
| 3 | A-4 | Source2 | 3 | 2016-02-08 07:23:34 | 2016-02-08 07:53:34 | 39.747753 | -84.205582 | NaN | NaN | 0.01 | ... | False | False | False | False | False | False | Night | Day | Day | Day |
| 4 | A-5 | Source2 | 2 | 2016-02-08 07:39:07 | 2016-02-08 08:09:07 | 39.627781 | -84.188354 | NaN | NaN | 0.01 | ... | False | False | False | False | True | False | Day | Day | Day | Day |
5 rows × 46 columns
# Data shape
df.shape
(7728394, 46)
# Data Types and Missing Values
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 7728394 entries, 0 to 7728393 Data columns (total 46 columns): # Column Dtype --- ------ ----- 0 ID object 1 Source object 2 Severity int64 3 Start_Time object 4 End_Time object 5 Start_Lat float64 6 Start_Lng float64 7 End_Lat float64 8 End_Lng float64 9 Distance(mi) float64 10 Description object 11 Street object 12 City object 13 County object 14 State object 15 Zipcode object 16 Country object 17 Timezone object 18 Airport_Code object 19 Weather_Timestamp object 20 Temperature(F) float64 21 Wind_Chill(F) float64 22 Humidity(%) float64 23 Pressure(in) float64 24 Visibility(mi) float64 25 Wind_Direction object 26 Wind_Speed(mph) float64 27 Precipitation(in) float64 28 Weather_Condition object 29 Amenity bool 30 Bump bool 31 Crossing bool 32 Give_Way bool 33 Junction bool 34 No_Exit bool 35 Railway bool 36 Roundabout bool 37 Station bool 38 Stop bool 39 Traffic_Calming bool 40 Traffic_Signal bool 41 Turning_Loop bool 42 Sunrise_Sunset object 43 Civil_Twilight object 44 Nautical_Twilight object 45 Astronomical_Twilight object dtypes: bool(13), float64(12), int64(1), object(20) memory usage: 2.0+ GB
# Summary Statistics
df.describe()
| Severity | Start_Lat | Start_Lng | End_Lat | End_Lng | Distance(mi) | Temperature(F) | Wind_Chill(F) | Humidity(%) | Pressure(in) | Visibility(mi) | Wind_Speed(mph) | Precipitation(in) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 7.728394e+06 | 7.728394e+06 | 7.728394e+06 | 4.325632e+06 | 4.325632e+06 | 7.728394e+06 | 7.564541e+06 | 5.729375e+06 | 7.554250e+06 | 7.587715e+06 | 7.551296e+06 | 7.157161e+06 | 5.524808e+06 |
| mean | 2.212384e+00 | 3.620119e+01 | -9.470255e+01 | 3.626183e+01 | -9.572557e+01 | 5.618423e-01 | 6.166329e+01 | 5.825105e+01 | 6.483104e+01 | 2.953899e+01 | 9.090376e+00 | 7.685490e+00 | 8.407210e-03 |
| std | 4.875313e-01 | 5.076079e+00 | 1.739176e+01 | 5.272905e+00 | 1.810793e+01 | 1.776811e+00 | 1.901365e+01 | 2.238983e+01 | 2.282097e+01 | 1.006190e+00 | 2.688316e+00 | 5.424983e+00 | 1.102246e-01 |
| min | 1.000000e+00 | 2.455480e+01 | -1.246238e+02 | 2.456601e+01 | -1.245457e+02 | 0.000000e+00 | -8.900000e+01 | -8.900000e+01 | 1.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 |
| 25% | 2.000000e+00 | 3.339963e+01 | -1.172194e+02 | 3.346207e+01 | -1.177543e+02 | 0.000000e+00 | 4.900000e+01 | 4.300000e+01 | 4.800000e+01 | 2.937000e+01 | 1.000000e+01 | 4.600000e+00 | 0.000000e+00 |
| 50% | 2.000000e+00 | 3.582397e+01 | -8.776662e+01 | 3.618349e+01 | -8.802789e+01 | 3.000000e-02 | 6.400000e+01 | 6.200000e+01 | 6.700000e+01 | 2.986000e+01 | 1.000000e+01 | 7.000000e+00 | 0.000000e+00 |
| 75% | 2.000000e+00 | 4.008496e+01 | -8.035368e+01 | 4.017892e+01 | -8.024709e+01 | 4.640000e-01 | 7.600000e+01 | 7.500000e+01 | 8.400000e+01 | 3.003000e+01 | 1.000000e+01 | 1.040000e+01 | 0.000000e+00 |
| max | 4.000000e+00 | 4.900220e+01 | -6.711317e+01 | 4.907500e+01 | -6.710924e+01 | 4.417500e+02 | 2.070000e+02 | 2.070000e+02 | 1.000000e+02 | 5.863000e+01 | 1.400000e+02 | 1.087000e+03 | 3.647000e+01 |
# Before cleaning the data check null count
df.isnull().sum()
ID 0 Source 0 Severity 0 Start_Time 0 End_Time 0 Start_Lat 0 Start_Lng 0 End_Lat 3402762 End_Lng 3402762 Distance(mi) 0 Description 5 Street 10869 City 253 County 0 State 0 Zipcode 1915 Country 0 Timezone 7808 Airport_Code 22635 Weather_Timestamp 120228 Temperature(F) 163853 Wind_Chill(F) 1999019 Humidity(%) 174144 Pressure(in) 140679 Visibility(mi) 177098 Wind_Direction 175206 Wind_Speed(mph) 571233 Precipitation(in) 2203586 Weather_Condition 173459 Amenity 0 Bump 0 Crossing 0 Give_Way 0 Junction 0 No_Exit 0 Railway 0 Roundabout 0 Station 0 Stop 0 Traffic_Calming 0 Traffic_Signal 0 Turning_Loop 0 Sunrise_Sunset 23246 Civil_Twilight 23246 Nautical_Twilight 23246 Astronomical_Twilight 23246 dtype: int64
cat = ['Description','Street','City','Zipcode','Timezone','Airport_Code','Weather_Timestamp',
'Wind_Direction','Weather_Condition','Sunrise_Sunset','Civil_Twilight','Nautical_Twilight','Astronomical_Twilight']
# Replacing null values with mode for categorical variables
for i in cat:
mean = df[i].mode()[0]
df[i].fillna(mean,inplace = True)
continues = ['End_Lat','End_Lng','Temperature(F)','Wind_Chill(F)','Humidity(%)','Pressure(in)','Visibility(mi)',
'Wind_Speed(mph)','Precipitation(in)']
# Replacing null values with mode for Numerical variables
for i in continues:
mean1= df[i].mean()
df[i].fillna(mean1,inplace = True)
# After cleaning the data
df.isnull().sum()
ID 0 Source 0 Severity 0 Start_Time 0 End_Time 0 Start_Lat 0 Start_Lng 0 End_Lat 0 End_Lng 0 Distance(mi) 0 Description 0 Street 0 City 0 County 0 State 0 Zipcode 0 Country 0 Timezone 0 Airport_Code 0 Weather_Timestamp 0 Temperature(F) 0 Wind_Chill(F) 0 Humidity(%) 0 Pressure(in) 0 Visibility(mi) 0 Wind_Direction 0 Wind_Speed(mph) 0 Precipitation(in) 0 Weather_Condition 0 Amenity 0 Bump 0 Crossing 0 Give_Way 0 Junction 0 No_Exit 0 Railway 0 Roundabout 0 Station 0 Stop 0 Traffic_Calming 0 Traffic_Signal 0 Turning_Loop 0 Sunrise_Sunset 0 Civil_Twilight 0 Nautical_Twilight 0 Astronomical_Twilight 0 dtype: int64
# Visualize contributing factors using Plotly
fig = px.bar(df['Traffic_Signal'].value_counts().reset_index(),
y='index', x='Traffic_Signal',
orientation='h',
labels={'index': 'Traffic Signal', 'Traffic_Signal': 'Count'},
title='Distribution of Traffic Signal as a Contributing Factor')
# Show the plot
fig.show()
x = df['Weather_Condition'].value_counts().reset_index()
# filter the Weather_Condition
x=x[x.Weather_Condition > 5000]
# Analyze patterns related to road conditions, weather, and time of day
# For example, let's check the distribution of road conditions
# Create a bar chart using Plotly
fig = px.bar(x,
y='index', x='Weather_Condition',
orientation='h',
labels={'index': 'Weather Condition', 'Weather_Condition': 'Count'},
title='Distribution of Weather Conditions')
# template='plotly_dark') # You can choose a different template if desired
# Show the plot
fig.show()
# Visualize accident hotspots (latitude and longitude) using Plotly
fig = px.scatter(df.sample(1000), x='Start_Lng', y='Start_Lat', color='Severity',
color_continuous_scale='viridis', opacity=0.7,
labels={'Start_Lng': 'Longitude', 'Start_Lat': 'Latitude', 'Severity': 'Severity'},
title='Accident Hotspots')
# Set layout for the figure
fig.update_layout(
xaxis_title='Longitude',
yaxis_title='Latitude',
legend_title='Severity')
# Show the plot
fig.show()
# Convert Start_Time to datetime for time-based analysis
df['Start_Time'] = pd.to_datetime(df['Start_Time'])
# Extract date, hour, and day of the week from Start_Time
df['Date'] = df['Start_Time'].dt.date
df['Hour'] = df['Start_Time'].dt.hour
df['Day_of_Week'] = df['Start_Time'].dt.day_name()
# Analyze distribution of accidents by hour of the day using Plotly
fig = px.bar(df['Hour'].value_counts().reset_index(),
x='index', y='Hour',
labels={'index': 'Hour of the Day', 'Hour': 'Count'},
title='Accident Distribution by Hour of the Day',
category_orders={'index': sorted(df['Hour'].unique())})
# Rotate x-axis labels for better readability
fig.update_layout(xaxis=dict(tickangle=45))
# Show the plot
fig.show()
# Analyze distribution of accidents by day of the week using Plotly
fig = px.bar(df['Day_of_Week'].value_counts().reset_index(),
x='index', y='Day_of_Week',
labels={'index': 'Day of the Week', 'Day_of_Week': 'Count'},
title='Accident Distribution by Day of the Week',
category_orders={'index': ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']})
# Rotate x-axis labels for better readability
fig.update_layout(xaxis=dict(tickangle=45))
# Show the plot
fig.show()
# Compute the correlation matrix
correlation_matrix = df.corr()
# Plot the correlation matrix heatmap using Plotly
fig = px.imshow(correlation_matrix,
labels=dict(color='Correlation'),
x=correlation_matrix.columns,
y=correlation_matrix.columns,
title='Correlation Matrix')
fig.update_layout(width=12*80, height=16*80)
# Show the plot
fig.show()
# Plot accident severity distribution by time of day using Plotly
fig = px.bar(df.groupby(['Hour', 'Severity']).size().reset_index(),
x='Hour', y=0, color='Severity',
labels={'0': 'Count', 'Severity': 'Severity'},
title='Accident Severity by Hour of the Day',
category_orders={'Severity': [1, 2, 3, 4]})
# Update layout to set the figure size
fig.update_layout(width=10*80, height=6*80) # Assuming each inch is 80 pixels
# Show the plot
fig.show()
# Plot accident severity distribution by weather condition using Plotly
severity_by_weather = df.groupby(['Weather_Condition', 'Severity']).size().reset_index()
# Sort values in descending order by count
severity_by_weather = severity_by_weather.sort_values(by=0, ascending=False)
#Above 3000 values filtering severity_by_weather
x = severity_by_weather[severity_by_weather[0] > 3000]
# Create the bar chart
fig = px.bar(x,
y='Weather_Condition', x=0, color='Severity',
labels={'0': 'Count', 'Severity': 'Severity'},
title='Accident Severity by Weather Condition',
category_orders={'Severity': [1, 2, 3, 4]})
# Update layout to set the figure size
fig.update_layout(width=12*80, height=9*80) # Assuming each inch is 80 pixels
# Show the plot
fig.show()
# Plot accident severity distribution by state using Plotly
fig = px.bar(df.groupby(['State', 'Severity']).size().reset_index(),
y='State', x=0, color='Severity',
labels={'0': 'Count', 'Severity': 'Severity'},
title='Accident Severity by State',
category_orders={'Severity': [1, 2, 3, 4]})
fig.update_layout(width=12*80, height=16*80)
# Show the plot
fig.show()
For time series analysis, we'll analyze trends and patterns in accidents over time, focusing on temporal aspects such as hour of the day and month of the year
# Grouping by hour and counting accidents
hourly_accidents = df.groupby('Hour').size().reset_index(name='Number_of_Accidents')
# Plotting the hourly trend of accidents using Plotly
fig = px.line(hourly_accidents, x='Hour', y='Number_of_Accidents',
markers=True, title='Hourly Trend of Accidents',
labels={'Hour': 'Hour of the Day', 'Number_of_Accidents': 'Number of Accidents'})
# Update layout for better readability
fig.update_layout(xaxis=dict(tickmode='linear', tick0=0, dtick=1),
width=12*80, height=6*80) # Assuming each inch is 80 pixels
# Show the plot
fig.show()
# Selecting relevant columns for multivariate analysis
selected_columns = ['Weather_Condition', 'Visibility(mi)', 'Severity']
# Creating a subset of the dataframe with selected columns
subset_df = df[selected_columns]
# Dropping rows with any missing values in the selected columns
subset_df.dropna(subset=selected_columns, inplace=True)
# Creating a heatmap to explore relationships using Plotly
heatmap_data = subset_df.groupby(['Weather_Condition', 'Visibility(mi)'])['Severity'].mean().reset_index()
fig = px.imshow(heatmap_data.pivot(index='Weather_Condition', columns='Visibility(mi)', values='Severity'),
labels=dict(color='Severity'),
x=heatmap_data['Visibility(mi)'].unique(),
y=heatmap_data['Weather_Condition'].unique(),
title='Relationships between Weather, Visibility, and Severity')
# Set layout for better readability
fig.update_layout(
xaxis_title='Visibility (mi)',
yaxis_title='Weather Condition')
# Update layout to set the figure size
fig.update_layout(width=12*80, height=19*80) # Assuming each inch is 80 pixels
# Show the plot
fig.show()
C:\Users\klvko\AppData\Local\Temp\ipykernel_11040\1927272465.py:8: SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
Based on the analysis of the US Accidents dataset, several insights and patterns have been identified:
Severity Analysis:
Accidents are distributed across different severity levels, with Severity 2 being the most common. Severity levels vary based on factors such as weather conditions, road conditions, and visibility. Temporal Patterns:
Accidents have distinct patterns throughout the day, with higher frequencies during peak hours, particularly in the afternoon. Monthly analysis shows variations, with an increase in accidents during the winter months. Geospatial Analysis:
Accidents are distributed unevenly across states, indicating higher accident rates in certain regions. Weather and Road Conditions:
Adverse weather conditions like rain and snow are associated with higher accident severity. Poor road conditions are linked to increased accident severity. Contributing Factors:
Traffic signals and crossings play a role in accident severity, with accidents near crossings often being severe. Road conditions significantly influence accident severity, with poor road conditions leading to more severe accidents. Multivariate Analysis:
Multivariate analysis indicates that specific combinations of weather conditions, road conditions, and visibility affect accident severity. Time Series Analysis:
Accidents exhibit an hourly trend, peaking during certain hours of the day. There are variations in the number of accidents based on the month, suggesting seasonality. These insights can be utilized to enhance road safety measures, optimize traffic management, and develop targeted interventions to reduce the severity and frequency of accidents.